Computation and Language 66
☆ I Could've Asked That: Reformulating Unanswerable Questions
When seeking information from unfamiliar documents, users frequently pose
questions that cannot be answered by the documents. While existing large
language models (LLMs) identify these unanswerable questions, they do not
assist users in reformulating their questions, thereby reducing their overall
utility. We curate CouldAsk, an evaluation benchmark composed of existing and
new datasets for document-grounded question answering, specifically designed to
study reformulating unanswerable questions. We evaluate state-of-the-art
open-source and proprietary LLMs on CouldAsk. The results demonstrate the
limited capabilities of these models in reformulating questions. Specifically,
GPT-4 and Llama2-7B successfully reformulate questions only 26% and 12% of the
time, respectively. Error analysis shows that 62% of the unsuccessful
reformulations stem from the models merely rephrasing the questions or even
generating identical questions. We publicly release the benchmark and the code
to reproduce the experiments.
☆ WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries
Wenting Zhao, Tanya Goyal, Yu Ying Chiu, Liwei Jiang, Benjamin Newman, Abhilasha Ravichander, Khyathi Chandu, Ronan Le Bras, Claire Cardie, Yuntian Deng, Yejin Choi
While hallucinations of large language models (LLMs) prevail as a major
challenge, existing evaluation benchmarks on factuality do not cover the
diverse domains of knowledge that the real-world users of LLMs seek information
about. To bridge this gap, we introduce WildHallucinations, a benchmark that
evaluates factuality. It does so by prompting LLMs to generate information
about entities mined from user-chatbot conversations in the wild. These
generations are then automatically fact-checked against a systematically
curated knowledge source collected from web search. Notably, half of these
real-world entities do not have associated Wikipedia pages. We evaluate 118,785
generations from 15 LLMs on 7,919 entities. We find that LLMs consistently
hallucinate more on entities without Wikipedia pages and exhibit varying
hallucination rates across different domains. Finally, given the same base
models, adding a retrieval component only slightly reduces hallucinations but
does not eliminate hallucinations.
☆ CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models
Large Language Models (LLMs) excel in diverse tasks but often underperform in
specialized fields due to limited domain-specific or proprietary corpus.
Continual pre-training (CPT) enhances LLM capabilities by imbuing new
domain-specific or proprietary knowledge while replaying general corpus to
prevent catastrophic forgetting. The data mixture ratio of general corpus and
domain-specific corpus, however, has been chosen heuristically, leading to
sub-optimal training efficiency in practice. In this context, we attempt to
re-visit the scaling behavior of LLMs under the hood of CPT, and discover a
power-law relationship between loss, mixture ratio, and training tokens scale.
We formalize the trade-off between general and domain-specific capabilities,
leading to a well-defined Critical Mixture Ratio (CMR) of general and domain
data. By striking the balance, CMR maintains the model's general ability and
achieves the desired domain transfer, ensuring the highest utilization of
available resources. Therefore, if we value the balance between efficiency and
effectiveness, CMR can be consider as the optimal mixture ratio.Through
extensive experiments, we ascertain the predictability of CMR, and propose CMR
scaling law and have substantiated its generalization. These findings offer
practical guidelines for optimizing LLM training in specialized domains,
ensuring both general and domain-specific performance while efficiently
managing training resources.
☆ Fluent Student-Teacher Redteaming
Many publicly available language models have been safety tuned to reduce the
likelihood of toxic or liability-inducing text. Users or security analysts
attempt to jailbreak or redteam these models with adversarial prompts which
cause compliance with requests. One attack method is to apply discrete
optimization techniques to the prompt. However, the resulting attack strings
are often gibberish text, easily filtered by defenders due to high measured
perplexity, and may fail for unseen tasks and/or well-tuned models. In this
work, we improve existing algorithms (primarily GCG and BEAST) to develop
powerful and fluent attacks on safety-tuned models like Llama-2 and Phi-3. Our
technique centers around a new distillation-based approach that encourages the
victim model to emulate a toxified finetune, either in terms of output
probabilities or internal activations. To encourage human-fluent attacks, we
add a multi-model perplexity penalty and a repetition penalty to the objective.
We also enhance optimizer strength by allowing token insertions, token swaps,
and token deletions and by using longer attack sequences. The resulting process
is able to reliably jailbreak the most difficult target models with prompts
that appear similar to human-written prompts. On Advbench we achieve attack
success rates $>93$% for Llama-2-7B, Llama-3-8B, and Vicuna-7B, while
maintaining model-measured perplexity $<33$; we achieve $95$% attack success
for Phi-3, though with higher perplexity. We also find a universally-optimized
single fluent prompt that induces $>88$% compliance on previously unseen tasks
across Llama-2-7B, Phi-3-mini and Vicuna-7B and transfers to other black-box
models.
☆ Dependency Transformer Grammars: Integrating Dependency Structures into Transformer Language Models
Syntactic Transformer language models aim to achieve better generalization
through simultaneously modeling syntax trees and sentences. While prior work
has been focusing on adding constituency-based structures to Transformers, we
introduce Dependency Transformer Grammars (DTGs), a new class of Transformer
language model with explicit dependency-based inductive bias. DTGs simulate
dependency transition systems with constrained attention patterns by modifying
attention masks, incorporate the stack information through relative positional
encoding, and augment dependency arc representation with a combination of token
embeddings and operation embeddings. When trained on a dataset of sentences
annotated with dependency trees, DTGs achieve better generalization while
maintaining comparable perplexity with Transformer language model baselines.
DTGs also outperform recent constituency-based models, showing that dependency
can better guide Transformer language models. Our code is released at
https://github.com/zhaoyd1/Dep_Transformer_Grammars.
☆ CovScore: Evaluation of Multi-Document Abstractive Title Set Generation
This paper introduces CovScore, an automatic reference-less methodology for
evaluating thematic title sets, extracted from a corpus of documents. While
such extraction methods are widely used, evaluating their effectiveness remains
an open question. Moreover, some existing practices heavily rely on slow and
laborious human annotation procedures. Inspired by recently introduced
LLM-based judge methods, we propose a novel methodology that decomposes quality
into five main metrics along different aspects of evaluation. This framing
simplifies and expedites the manual evaluation process and enables automatic
and independent LLM-based evaluation. As a test case, we apply our approach to
a corpus of Holocaust survivor testimonies, motivated both by its relevance to
title set extraction and by the moral significance of this pursuit. We validate
the methodology by experimenting with naturalistic and synthetic title set
generation systems and compare their performance with the methodology.
☆ PERSONA: A Reproducible Testbed for Pluralistic Alignment
The rapid advancement of language models (LMs) necessitates robust alignment
with diverse user values. However, current preference optimization approaches
often fail to capture the plurality of user opinions, instead reinforcing
majority viewpoints and marginalizing minority perspectives. We introduce
PERSONA, a reproducible test bed designed to evaluate and improve pluralistic
alignment of LMs. We procedurally generate diverse user profiles from US census
data, resulting in 1,586 synthetic personas with varied demographic and
idiosyncratic attributes. We then generate a large-scale evaluation dataset
containing 3,868 prompts and 317,200 feedback pairs obtained from our synthetic
personas. Leveraging this dataset, we systematically evaluate LM capabilities
in role-playing diverse users, verified through human judges, and the
establishment of both a benchmark, PERSONA Bench, for pluralistic alignment
approaches as well as an extensive dataset to create new and future benchmarks.
The full dataset and benchmarks are available here:
https://www.synthlabs.ai/research/persona.
☆ A Comprehensive Approach to Misspelling Correction with BERT and Levenshtein Distance
Writing, as an omnipresent form of human communication, permeates nearly
every aspect of contemporary life. Consequently, inaccuracies or errors in
written communication can lead to profound consequences, ranging from financial
losses to potentially life-threatening situations. Spelling mistakes, among the
most prevalent writing errors, are frequently encountered due to various
factors. This research aims to identify and rectify diverse spelling errors in
text using neural networks, specifically leveraging the Bidirectional Encoder
Representations from Transformers (BERT) masked language model. To achieve this
goal, we compiled a comprehensive dataset encompassing both non-real-word and
real-word errors after categorizing different types of spelling mistakes.
Subsequently, multiple pre-trained BERT models were employed. To ensure optimal
performance in correcting misspelling errors, we propose a combined approach
utilizing the BERT masked language model and Levenshtein distance. The results
from our evaluation data demonstrate that the system presented herein exhibits
remarkable capabilities in identifying and rectifying spelling mistakes, often
surpassing existing systems tailored for the Persian language.
comment: 12 pages, 9 figures, 5 tables
☆ MMRA: A Benchmark for Multi-granularity Multi-image Relational Association
Siwei Wu, Kang Zhu, Yu Bai, Yiming Liang, Yizhi Li, Haoning Wu, Jiaheng Liu, Ruibo Liu, Xingwei Qu, Xuxin Cheng, Ge Zhang, Wenhao Huang, Chenghua Lin
Given the remarkable success that large visual language models (LVLMs) have
achieved in image perception tasks, the endeavor to make LVMLs perceive the
world like humans is drawing increasing attention. Current multi-modal
benchmarks mainly focus on the objective fact or certain topic related
potential knowledge within a image, but overlook the associative relations
between multiple images. Therefore, we define a multi-image relation
association task, and meticulously curate \textbf{MMRA} benchmark, a
\textbf{M}ulti-granularity \textbf{M}ulti-image \textbf{R}elational
\textbf{A}ssociation benchmark, consisted of \textbf{1026} samples. In order to
systematically and comprehensively evaluate mainstream LVLMs, we establish an
associational relation system among images that contain \textbf{11 subtasks}
(e.g, UsageSimilarity, SubEvent, etc.) at two granularity levels (i.e.,
"\textbf{image}" and "\textbf{entity}") according to the relations in
ConceptNet. Our experiments demonstrate that, on our MMRA benchmark, current
mainstream LVLMs all have their own advantages and disadvantages across
different subtasks. It is worth noting that, at the entity level, the
performance of all models is worse than that of them at the image level,
indicating that the fine-grained multi-image perception task is still
challenging for LVLMs. The tasks related to spatial perception are relatively
difficult for LVLMs to handle. Furthermore, we find that LVMLs exhibit a good
ability to perceive image details, and the key to enhancing their multi-image
association capability is to strengthen the reasoning ability of their language
model component. All our codes and data are released at
htt\url{https://github.com/Wusiwei0410/MMRA}.
comment: VLMS, Multi-Image Association
☆ Boosting Large Language Models with Socratic Method for Conversational Mathematics Teaching CIKM 2024
With the introduction of large language models (LLMs), automatic math
reasoning has seen tremendous success. However, current methods primarily focus
on providing solutions or using techniques like Chain-of-Thought to enhance
problem-solving accuracy. In this paper, we focus on improving the capability
of mathematics teaching via a Socratic teaching-based LLM
(\texttt{SocraticLLM}), which guides learners toward profound thinking with
clarity and self-discovery via conversation. We collect and release a
high-quality mathematical teaching dataset, named \texttt{SocraticMATH}, which
provides Socratic-style conversations of problems with extra knowledge. Also,
we propose a knowledge-enhanced LLM as a strong baseline to generate reliable
responses with review, guidance/heuristic, rectification, and summarization.
Experimental results show the great advantages of \texttt{SocraticLLM} by
comparing it with several strong generative models. The codes and datasets are
available on \url{https://github.com/ECNU-ICALK/SocraticMath}.
comment: Accepted By CIKM 2024
☆ Label Alignment and Reassignment with Generalist Large Language Model for Enhanced Cross-Domain Named Entity Recognition
Named entity recognition on the in-domain supervised and few-shot settings
have been extensively discussed in the NLP community and made significant
progress. However, cross-domain NER, a more common task in practical scenarios,
still poses a challenge for most NER methods. Previous research efforts in that
area primarily focus on knowledge transfer such as correlate label information
from source to target domains but few works pay attention to the problem of
label conflict. In this study, we introduce a label alignment and reassignment
approach, namely LAR, to address this issue for enhanced cross-domain named
entity recognition, which includes two core procedures: label alignment between
source and target domains and label reassignment for type inference. The
process of label reassignment can significantly be enhanced by integrating with
an advanced large-scale language model such as ChatGPT. We conduct an extensive
range of experiments on NER datasets involving both supervised and zero-shot
scenarios. Empirical experimental results demonstrate the validation of our
method with remarkable performance under the supervised and zero-shot
out-of-domain settings compared to SOTA methods.
comment: 9 pages, 4 figures
☆ How Good (Or Bad) Are LLMs at Detecting Misleading Visualizations? IEEE VIS 2024
In this study, we address the growing issue of misleading charts, a prevalent
problem that undermines the integrity of information dissemination. Misleading
charts can distort the viewer's perception of data, leading to
misinterpretations and decisions based on false information. The development of
effective automatic detection methods for misleading charts is an urgent field
of research. The recent advancement of multimodal Large Language Models (LLMs)
has introduced a promising direction for addressing this challenge. We explored
the capabilities of these models in analyzing complex charts and assessing the
impact of different prompting strategies on the models' analyses. We utilized a
dataset of misleading charts collected from the internet by prior research and
crafted nine distinct prompts, ranging from simple to complex, to test the
ability of four different multimodal LLMs in detecting over 21 different chart
issues. Through three experiments--from initial exploration to detailed
analysis--we progressively gained insights into how to effectively prompt LLMs
to identify misleading charts and developed strategies to address the
scalability challenges encountered as we expanded our detection range from the
initial five issues to 21 issues in the final experiment. Our findings reveal
that multimodal LLMs possess a strong capability for chart comprehension and
critical thinking in data interpretation. There is significant potential in
employing multimodal LLMs to counter misleading information by supporting
critical thinking and enhancing visualization literacy. This study demonstrates
the applicability of LLMs in addressing the pressing concern of misleading
charts.
comment: To be presented at IEEE VIS 2024
☆ Improving ICD coding using Chapter based Named Entities and Attentional Models
Recent advancements in natural language processing (NLP) have led to
automation in various domains. However, clinical NLP often relies on benchmark
datasets that may not reflect real-world scenarios accurately. Automatic ICD
coding, a vital NLP task, typically uses outdated and imbalanced datasets like
MIMIC-III, with existing methods yielding micro-averaged F1 scores between 0.4
and 0.7 due to many false positives. Our research introduces an enhanced
approach to ICD coding that improves F1 scores by using chapter-based named
entities and attentional models. This method categorizes discharge summaries
into ICD-9 Chapters and develops attentional models with chapter-specific data,
eliminating the need to consider external data for code identification. For
categorization, we use Chapter-IV to de-bias and influence key entities and
weights without neural networks, creating accurate thresholds and providing
interpretability for human validation. Post-validation, we develop attentional
models for three frequent and three non-frequent codes from Chapter-IV using
Bidirectional-Gated Recurrent Units (GRUs) with Attention and Transformer with
Multi-head Attention architectures. The average Micro-F1 scores of 0.79 and
0.81 from these models demonstrate significant performance improvements in ICD
coding.
comment: 10 Pages
☆ LEAN-GitHub: Compiling GitHub LEAN repositories for a versatile LEAN prover
Recently, large language models have presented promising results in aiding
formal mathematical reasoning. However, their performance is restricted due to
the scarcity of formal theorem-proving data, which requires additional effort
to be extracted from raw formal language corpora. Meanwhile, a significant
amount of human-written formal language corpora remains underutilized. To
address this issue, we propose LEAN-GitHub, a dataset consisting of large-scale
formal data extracted from almost all Lean 4 repositories on GitHub. After
fine-tuning InternLM-math-plus on this dataset, our model achieved accuracies
of 48.8% with a single pass and 54.5% with 64 passes on the Lean 4 miniF2F
test, surpassing state-of-the-art method at 52%. And it also achieves
state-of-the-art on two other Lean 4 benchmarks (ProofNet and Putnam) targeting
different fields/levels of math. These results demonstrate that our proposed
dataset is beneficial for formal reasoning on a wide range of math topics. We
open-source our model at https://GitHub. com/InternLM/InternLM-Math and our
data at https://huggingface.co/ datasets/InternLM/Lean-GitHub
☆ NarrationDep: Narratives on Social Media For Automatic Depression Detection
Social media posts provide valuable insight into the narrative of users and
their intentions, including providing an opportunity to automatically model
whether a social media user is depressed or not. The challenge lies in
faithfully modelling user narratives from their online social media posts,
which could potentially be useful in several different applications. We have
developed a novel and effective model called \texttt{NarrationDep}, which
focuses on detecting narratives associated with depression. By analyzing a
user's tweets, \texttt{NarrationDep} accurately identifies crucial narratives.
\texttt{NarrationDep} is a deep learning framework that jointly models
individual user tweet representations and clusters of users' tweets. As a
result, \texttt{NarrationDep} is characterized by a novel two-layer deep
learning model: the first layer models using social media text posts, and the
second layer learns semantic representations of tweets associated with a
cluster. To faithfully model these cluster representations, the second layer
incorporates a novel component that hierarchically learns from users' posts.
The results demonstrate that our framework outperforms other comparative models
including recently developed models on a variety of datasets.
☆ Speech Editing -- a Summary
With the rise of video production and social media, speech editing has become
crucial for creators to address issues like mispronunciations, missing words,
or stuttering in audio recordings. This paper explores text-based speech
editing methods that modify audio via text transcripts without manual waveform
editing. These approaches ensure edited audio is indistinguishable from the
original by altering the mel-spectrogram. Recent advancements, such as
context-aware prosody correction and advanced attention mechanisms, have
improved speech editing quality. This paper reviews state-of-the-art methods,
compares key metrics, and examines widely used datasets. The aim is to
highlight ongoing issues and inspire further research and innovation in speech
editing.
☆ Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model
In this paper, we experimented with the SpeechT5 model pre-trained on
large-scale datasets. We pre-trained the foundation model from scratch and
fine-tuned it on a large-scale robust multi-speaker text-to-speech (TTS) task.
We tested the model capabilities in a zero- and few-shot scenario. Based on two
listening tests, we evaluated the synthetic audio quality and the similarity of
how synthetic voices resemble real voices. Our results showed that the SpeechT5
model can generate a synthetic voice for any speaker using only one minute of
the target speaker's data. We successfully demonstrated the high quality and
similarity of our synthetic voices on publicly known Czech politicians and
celebrities.
comment: Accepted to TSD2024
☆ A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives INTERSPEECH2024
In this paper, we are comparing monolingual Wav2Vec 2.0 models with various
multilingual models to see whether we could improve speech recognition
performance on a unique oral history archive containing a lot of mixed-language
sentences. Our main goal is to push forward research on this unique dataset,
which is an extremely valuable part of our cultural heritage. Our results
suggest that monolingual speech recognition models are, in most cases, superior
to multilingual models, even when processing the oral history archive full of
mixed-language sentences from non-native speakers. We also performed the same
experiments on the public CommonVoice dataset to verify our results. We are
contributing to the research community by releasing our pre-trained models to
the public.
comment: Accepted to INTERSPEECH2024
☆ SimCT: A Simple Consistency Test Protocol in LLMs Development Lifecycle
In this work, we report our efforts to advance the standard operation
procedure of developing Large Language Models (LLMs) or LLMs-based systems or
services in industry. We introduce the concept of Large Language Model
Development Lifecycle (LDLC) and then highlight the importance of consistency
test in ensuring the delivery quality. The principled solution of consistency
test, however, is usually overlooked by industrial practitioners and not urgent
in academia, and current practical solutions are insufficiently rigours and
labor-intensive. We thus propose a simple yet effective consistency test
protocol, named SimCT. SimCT is mainly to proactively check the consistency
across different development stages of "bare metal" LLMs or associated services
without accessing the model artifacts, in an attempt to expedite the delivery
by reducing the back-and-forth alignment communications among multiple teams
involved in different development stages.
Specifically, SimCT encompasses response-wise and model-wise tests. We
implement the protocol with LightGBM and Student's t-test for two components
respectively, and perform extensive experiments to substantiate the
effectiveness of SimCT and the involved components.
☆ SDoH-GPT: Using Large Language Models to Extract Social Determinants of Health (SDoH)
Bernardo Consoli, Xizhi Wu, Song Wang, Xinyu Zhao, Yanshan Wang, Justin Rousseau, Tom Hartvigsen, Li Shen, Huanmei Wu, Yifan Peng, Qi Long, Tianlong Chen, Ying Ding
Extracting social determinants of health (SDoH) from unstructured medical
notes depends heavily on labor-intensive annotations, which are typically
task-specific, hampering reusability and limiting sharing. In this study we
introduced SDoH-GPT, a simple and effective few-shot Large Language Model (LLM)
method leveraging contrastive examples and concise instructions to extract SDoH
without relying on extensive medical annotations or costly human intervention.
It achieved tenfold and twentyfold reductions in time and cost respectively,
and superior consistency with human annotators measured by Cohen's kappa of up
to 0.92. The innovative combination of SDoH-GPT and XGBoost leverages the
strengths of both, ensuring high accuracy and computational efficiency while
consistently maintaining 0.90+ AUROC scores. Testing across three distinct
datasets has confirmed its robustness and accuracy. This study highlights the
potential of leveraging LLMs to revolutionize medical note classification,
demonstrating their capability to achieve highly accurate classifications with
significantly reduced time and cost.
☆ Behavioral Testing: Can Large Language Models Implicitly Resolve Ambiguous Entities?
One of the major aspects contributing to the striking performance of large
language models (LLMs) is the vast amount of factual knowledge accumulated
during pre-training. Yet, many LLMs suffer from self-inconsistency, which
raises doubts about their trustworthiness and reliability. In this paper, we
focus on entity type ambiguity and analyze current state-of-the-art LLMs for
their proficiency and consistency in applying their factual knowledge when
prompted for entities under ambiguity. To do so, we propose an evaluation
protocol that disentangles knowing from applying knowledge, and test
state-of-the-art LLMs on 49 entities. Our experiments reveal that LLMs perform
poorly with ambiguous prompts, achieving only 80% accuracy. Our results further
demonstrate systematic discrepancies in LLM behavior and their failure to
consistently apply information, indicating that the models can exhibit
knowledge without being able to utilize it, significant biases for preferred
readings, as well as self inconsistencies. Our study highlights the importance
of handling entity ambiguity in future for more trustworthy LLMs
☆ A Survey Forest Diagram : Gain a Divergent Insight View on a Specific Research Topic
With the exponential growth in the number of papers and the trend of AI
research, the use of Generative AI for information retrieval and
question-answering has become popular for conducting research surveys. However,
novice researchers unfamiliar with a particular field may not significantly
improve their efficiency in interacting with Generative AI because they have
not developed divergent thinking in that field. This study aims to develop an
in-depth Survey Forest Diagram that guides novice researchers in divergent
thinking about the research topic by indicating the citation clues among
multiple papers, to help expand the survey perspective for novice researchers.
comment: This paper will submit to IEEE SMC 2024
☆ SAFETY-J: Evaluating Safety with Critique
The deployment of Large Language Models (LLMs) in content generation raises
significant safety concerns, particularly regarding the transparency and
interpretability of content evaluations. Current methods, primarily focused on
binary safety classifications, lack mechanisms for detailed critique, limiting
their utility for model improvement and user trust. To address these
limitations, we introduce SAFETY-J, a bilingual generative safety evaluator for
English and Chinese with critique-based judgment. SAFETY-J utilizes a robust
training dataset that includes diverse dialogues and augmented query-response
pairs to assess safety across various scenarios comprehensively. We establish
an automated meta-evaluation benchmark that objectively assesses the quality of
critiques with minimal human intervention, facilitating scalable and continuous
improvement. Additionally, SAFETY-J employs an iterative preference learning
technique to dynamically refine safety assessments based on meta-evaluations
and critiques. Our evaluations demonstrate that SAFETY-J provides more nuanced
and accurate safety evaluations, thereby enhancing both critique quality and
predictive reliability in complex content scenarios. To facilitate further
research and application, we will open-source SAFETY-J's training protocols,
datasets, and code.
☆ High Efficiency Image Compression for Large Visual-Language Models
In recent years, large visual language models (LVLMs) have shown impressive
performance and promising generalization capability in multi-modal tasks, thus
replacing humans as receivers of visual information in various application
scenarios. In this paper, we pioneer to propose a variable bitrate image
compression framework consisting of a pre-editing module and an end-to-end
codec to achieve promising rate-accuracy performance for different LVLMs. In
particular, instead of optimizing an adaptive pre-editing network towards a
particular task or several representative tasks, we propose a new optimization
strategy tailored for LVLMs, which is designed based on the representation and
discrimination capability with token-level distortion and rank. The pre-editing
module and the variable bitrate end-to-end image codec are jointly trained by
the losses based on semantic tokens of the large model, which introduce
enhanced generalization capability for various data and tasks. {Experimental
results demonstrate that the proposed framework could efficiently achieve much
better rate-accuracy performance compared to the state-of-the-art coding
standard, Versatile Video Coding.} Meanwhile, experiments with multi-modal
tasks have revealed the robustness and generalization capability of the
proposed framework.
☆ From Internal Conflict to Contextual Adaptation of Language Models
Knowledge-intensive language understanding tasks require Language Models
(LMs) to integrate relevant context, mitigating their inherent weaknesses, such
as incomplete or outdated knowledge. Nevertheless, studies indicate that LMs
often ignore the provided context as it can conflict with the pre-existing LM's
memory learned during pre-training. Moreover, conflicting knowledge can already
be present in the LM's parameters, termed intra-memory conflict. Existing works
have studied the two types of knowledge conflicts only in isolation. We
conjecture that the (degree of) intra-memory conflicts can in turn affect LM's
handling of context-memory conflicts. To study this, we introduce the DYNAMICQA
dataset, which includes facts with a temporal dynamic nature where a fact can
change with a varying time frequency and disputable dynamic facts, which can
change depending on the viewpoint. DYNAMICQA is the first to include real-world
knowledge conflicts and provide context to study the link between the different
types of knowledge conflicts. With the proposed dataset, we assess the use of
uncertainty for measuring the intra-memory conflict and introduce a novel
Coherent Persuasion (CP) score to evaluate the context's ability to sway LM's
semantic output. Our extensive experiments reveal that static facts, which are
unlikely to change, are more easily updated with additional context, relative
to temporal and disputable facts.
comment: 22 pages, 15 figures
☆ Can Language Models Evaluate Human Written Text? Case Study on Korean Student Writing for Education
Large language model (LLM)-based evaluation pipelines have demonstrated their
capability to robustly evaluate machine-generated text. Extending this
methodology to assess human-written text could significantly benefit
educational settings by providing direct feedback to enhance writing skills,
although this application is not straightforward. In this paper, we investigate
whether LLMs can effectively assess human-written text for educational
purposes. We collected 100 texts from 32 Korean students across 15 types of
writing and employed GPT-4-Turbo to evaluate them using grammaticality,
fluency, coherence, consistency, and relevance as criteria. Our analyses
indicate that LLM evaluators can reliably assess grammaticality and fluency, as
well as more objective types of writing, though they struggle with other
criteria and types of writing. We publicly release our dataset and feedback.
comment: Work In Progress
☆ Unveiling In-Context Learning: A Coordinate System to Understand Its Working Mechanism
Large language models (LLMs) exhibit remarkable in-context learning (ICL)
capabilities. However, the underlying working mechanism of ICL remains poorly
understood. Recent research presents two conflicting views on ICL: One
attributes it to LLMs' inherent ability of task recognition, deeming label
correctness and shot numbers of demonstrations as not crucial; the other
emphasizes the impact of similar examples in the demonstrations, stressing the
need for label correctness and more shots. In this work, we provide a
Two-Dimensional Coordinate System that unifies both views into a systematic
framework. The framework explains the behavior of ICL through two orthogonal
variables: whether LLMs can recognize the task and whether similar examples are
presented in the demonstrations. We propose the peak inverse rank metric to
detect the task recognition ability of LLMs and study LLMs' reactions to
different definitions of similarity. Based on these, we conduct extensive
experiments to elucidate how ICL functions across each quadrant on multiple
representative classification tasks. Finally, we extend our analyses to
generation tasks, showing that our coordinate system can also be used to
interpret ICL for generation tasks effectively.
☆ Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective
This paper investigates Who's Harry Potter (WHP), a pioneering yet
insufficiently understood method for LLM unlearning. We explore it in two
steps. First, we introduce a new task of LLM targeted unlearning, where given
an unlearning target (e.g., a person) and some unlearning documents, we aim to
unlearn only the information about the target, rather than everything in the
unlearning documents. We further argue that a successful unlearning should
satisfy criteria such as not outputting gibberish, not fabricating facts about
the unlearning target, and not releasing factual information under jailbreak
attacks. Second, we construct a causal intervention framework for targeted
unlearning, where the knowledge of the unlearning target is modeled as a
confounder between LLM input and output, and the unlearning process as a
deconfounding process. This framework justifies and extends WHP, deriving a
simple unlearning algorithm that includes WHP as a special case. Experiments on
existing and new datasets show that our approach, without explicitly optimizing
for the aforementioned criteria, achieves competitive performance in all of
them. Our code is available at
https://github.com/UCSB-NLP-Chang/causal_unlearn.git.
☆ A Voter-Based Stochastic Rejection-Method Framework for Asymptotically Safe Language Model Outputs
This paper proposes a new method for preventing unsafe or otherwise low
quality large language model (LLM) outputs, by leveraging the stochasticity of
LLMs. We propose a system whereby LLM checkers vote on the acceptability of a
generated output, regenerating it if a threshold of disapproval is reached,
until sufficient checkers approve. We further propose estimators for cost and
failure rate, and based on those estimators and experimental data tailored to
the application, we propose an algorithm that achieves a desired failure rate
at the least possible cost. We demonstrate that, under these models, failure
rate decreases exponentially as a function of cost when voter count and
threshold are chosen according to the algorithm, and that the models reasonably
estimate the actual performance of such a system in action, even with limited
data.
comment: 7 pages, 2 figures
☆ Towards Aligning Language Models with Textual Feedback
We present ALT (ALignment with Textual feedback), an approach that aligns
language models with user preferences expressed in text. We argue that text
offers greater expressiveness, enabling users to provide richer feedback than
simple comparative preferences and this richer feedback can lead to more
efficient and effective alignment. ALT aligns the model by conditioning its
generation on the textual feedback. Our method relies solely on language
modeling techniques and requires minimal hyper-parameter tuning, though it
still presents the main benefits of RL-based alignment algorithms and can
effectively learn from textual feedback. We explore the efficacy and efficiency
of textual feedback across different tasks such as toxicity reduction,
summarization, and dialog response generation. We find that ALT outperforms PPO
for the task of toxicity reduction while being able to match its performance on
summarization with only 20% of the samples. We also explore how ALT can be used
with feedback provided by an existing LLM where we explore an LLM providing
constrained and unconstrained textual feedback. We also outline future
directions to align models with natural language feedback.
☆ Towards Transfer Unlearning: Empirical Evidence of Cross-Domain Bias Mitigation
Large language models (LLMs) often inherit biases from vast amounts of
training corpora. Traditional debiasing methods, while effective to some
extent, do not completely eliminate memorized biases and toxicity in LLMs. In
this paper, we study an unlearning-based approach to debiasing in LLMs by
performing gradient ascent on hate speech against minority groups, i.e.,
minimizing the likelihood of biased or toxic content. Specifically, we propose
a mask language modeling unlearning technique, which unlearns the harmful part
of the text. This method enables LLMs to selectively forget and disassociate
from biased and harmful content. Experimental results demonstrate the
effectiveness of our approach in diminishing bias while maintaining the
language modeling abilities. Surprisingly, the results also unveil an
unexpected potential for cross-domain transfer unlearning: debiasing in one
bias form (e.g. gender) may contribute to mitigating others (e.g. race and
religion).
☆ Early screening of potential breakthrough technologies with enhanced interpretability: A patent-specific hierarchical attention network model
Despite the usefulness of machine learning approaches for the early screening
of potential breakthrough technologies, their practicality is often hindered by
opaque models. To address this, we propose an interpretable machine learning
approach to predicting future citation counts from patent texts using a
patent-specific hierarchical attention network (PatentHAN) model. Central to
this approach are (1) a patent-specific pre-trained language model, capturing
the meanings of technical words in patent claims, (2) a hierarchical network
structure, enabling detailed analysis at the claim level, and (3) a claim-wise
self-attention mechanism, revealing pivotal claims during the screening
process. A case study of 35,376 pharmaceutical patents demonstrates the
effectiveness of our approach in early screening of potential breakthrough
technologies while ensuring interpretability. Furthermore, we conduct
additional analyses using different language models and claim types to examine
the robustness of the approach. It is expected that the proposed approach will
enhance expert-machine collaboration in identifying breakthrough technologies,
providing new insight derived from text mining into technological value.
☆ ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering
Xiuying Chen, Tairan Wang, Taicheng Guo, Kehan Guo, Juexiao Zhou, Haoyang Li, Mingchen Zhuge, Jürgen Schmidhuber, Xin Gao, Xiangliang Zhang
Question Answering (QA) effectively evaluates language models' reasoning and
knowledge depth. While QA datasets are plentiful in areas like general domain
and biomedicine, academic chemistry is less explored. Chemical QA plays a
crucial role in both education and research by effectively translating complex
chemical information into readily understandable format. Addressing this gap,
we introduce ScholarChemQA, a large-scale QA dataset constructed from chemical
papers. This dataset reflects typical real-world challenges, including an
imbalanced data distribution and a substantial amount of unlabeled data that
can be potentially useful. Correspondingly, we introduce a QAMatch model,
specifically designed to effectively answer chemical questions by fully
leveraging our collected data. We first address the issue of imbalanced label
distribution by re-weighting the instance-wise loss based on the inverse
frequency of each class, ensuring minority classes are not dominated by
majority ones during optimization. Next, we utilize the unlabeled data to
enrich the learning process, generating a variety of augmentations based on a
SoftMix operation and ensuring their predictions align with the same target,
i.e., pseudo-labels. To ensure the quality of the pseudo-labels, we propose a
calibration procedure aimed at closely aligning the pseudo-label estimates of
individual samples with a desired ground truth distribution. Experiments show
that our QAMatch significantly outperforms the recent similar-scale baselines
and Large Language Models (LLMs) not only on our ScholarChemQA dataset but also
on four benchmark datasets. We hope our benchmark and model can facilitate and
promote more research on chemical QA.
comment: 14 pages
☆ Train-Attention: Meta-Learning Where to Focus in Continual Knowledge Learning
Previous studies on continual knowledge learning (CKL) in large language
models (LLMs) have predominantly focused on approaches such as regularization,
architectural modifications, and rehearsal techniques to mitigate catastrophic
forgetting. However, these methods naively inherit the inefficiencies of
standard training procedures, indiscriminately applying uniform weight across
all tokens, which can lead to unnecessary parameter updates and increased
forgetting. To address these shortcomings, we propose a novel CKL approach
termed Train-Attention-Augmented Language Model (TAALM), which enhances
learning efficiency by dynamically predicting and applying weights to tokens
based on their usefulness. This method employs a meta-learning framework that
optimizes token importance predictions, facilitating targeted knowledge updates
and minimizing forgetting. Also, we observe that existing benchmarks do not
clearly exhibit the trade-off between learning and retaining, therefore we
propose a new benchmark, \textsc{LAMA-ckl}, to address this issue. Through
experiments conducted on both newly introduced and established CKL benchmarks,
TAALM proves the state-of-the-art performance upon the baselines, and also
shows synergistic compatibility when integrated with previous CKL approaches.
♻ ☆ A Unified Framework for Model Editing ACL 2024
ROME and MEMIT are largely believed to be two different model editing
algorithms, with the major difference between them being the ability to perform
batched edits. In this paper, we unify these two algorithms under a single
conceptual umbrella, optimizing for the same goal, which we call the
preservation-memorization objective. ROME uses an equality constraint to
optimize this objective to perform one edit at a time, whereas MEMIT employs a
more flexible least-square constraint that allows for batched edits. We
generalize ROME and enable batched editing with equality constraint in the form
of EMMET - an Equality-constrained Mass Model Editing algorithm for
Transformers, a new batched memory-editing algorithm. EMMET can perform
batched-edits up to a batch-size of 10,000, with very similar performance to
MEMIT across multiple dimensions. With the introduction of EMMET, we truly
unify ROME and MEMIT and show that both algorithms are equivalent in terms of
their optimization objective, their abilities (singular and batched editing),
their model editing performance and their limitations.
comment: Under review. To appear as poster at KnowledgeableLM Workshop
co-located with ACL 2024
♻ ☆ Dissecting Language Models: Machine Unlearning via Selective Pruning
Understanding and shaping the behaviour of Large Language Models (LLMs) is
increasingly important as applications become more powerful and more frequently
adopted. This paper introduces a machine unlearning method specifically
designed for LLMs. We introduce a selective pruning method for LLMs that
removes neurons based on their relative importance on a targeted capability
compared to overall network performance. This approach is a compute- and
data-efficient method for identifying and removing neurons that enable specific
behaviours. Our findings reveal that both feed-forward and attention neurons in
LLMs are specialized; that is, for specific tasks, certain neurons are more
crucial than others. Code from all experiments is available at
https://github.com/nickypro/selective-pruning
♻ ☆ Consent in Crisis: The Rapid Decline of the AI Data Commons
Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh, Manuel Cherep, Ahmad Anis, An Dinh, Caroline Chitongo, Da Yin, Damien Sileo, Deividas Mataciunas, Diganta Misra, Emad Alghamdi, Enrico Shippole, Jianguo Zhang, Joanna Materzynska, Kun Qian, Kush Tiwary, Lester Miranda, Manan Dey, Minnie Liang, Mohammed Hamdy, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Shrestha Mohanty, Vipul Gupta, Vivek Sharma, Vu Minh Chien, Xuhui Zhou, Yizhi Li, Caiming Xiong, Luis Villa, Stella Biderman, Hanlin Li, Daphne Ippolito, Sara Hooker, Jad Kabbara, Sandy Pentland
General-purpose artificial intelligence (AI) systems are built on massive
swathes of public web data, assembled into corpora such as C4, RefinedWeb, and
Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit
of the consent protocols for the web domains underlying AI training corpora.
Our audit of 14,000 web domains provides an expansive view of crawlable web
data and how codified data use preferences are changing over time. We observe a
proliferation of AI-specific clauses to limit use, acute differences in
restrictions on AI developers, as well as general inconsistencies between
websites' expressed intentions in their Terms of Service and their robots.txt.
We diagnose these as symptoms of ineffective web protocols, not designed to
cope with the widespread re-purposing of the internet for AI. Our longitudinal
analyses show that in a single year (2023-2024) there has been a rapid
crescendo of data restrictions from web sources, rendering ~5%+ of all tokens
in C4, or 28%+ of the most actively maintained, critical sources in C4, fully
restricted from use. For Terms of Service crawling restrictions, a full 45% of
C4 is now restricted. If respected or enforced, these restrictions are rapidly
biasing the diversity, freshness, and scaling laws for general-purpose AI
systems. We hope to illustrate the emerging crises in data consent, for both
developers and creators. The foreclosure of much of the open web will impact
not only commercial AI, but also non-commercial AI and academic research.
comment: 41 pages (13 main), 5 figures, 9 tables
♻ ☆ How Easily do Irrelevant Inputs Skew the Responses of Large Language Models?
By leveraging the retrieval of information from external knowledge databases,
Large Language Models (LLMs) exhibit enhanced capabilities for accomplishing
many knowledge-intensive tasks. However, due to the inherent flaws of current
retrieval systems, there might exist irrelevant information within those
retrieving top-ranked passages. In this work, we present a comprehensive
investigation into the robustness of LLMs to different types of irrelevant
information under various conditions. We initially introduce a framework to
construct high-quality irrelevant information that ranges from semantically
unrelated, partially related, and related to questions. Furthermore, our
analysis demonstrates that the constructed irrelevant information not only
scores highly on similarity metrics, being highly retrieved by existing
systems, but also bears semantic connections to the context. Our investigation
reveals that current LLMs still face challenges in discriminating highly
semantically related information and can be easily distracted by these
irrelevant yet misleading content. Besides, we also find that current solutions
for handling irrelevant information have limitations in improving the
robustness of LLMs to such distractions. All the resources are available on
GitHub at https://github.com/Di-viner/LLM-Robustness-to-Irrelevant-Information.
comment: COLM 2024
♻ ☆ MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms ACL 2024
Social media platforms are hubs for multimodal information exchange,
encompassing text, images, and videos, making it challenging for machines to
comprehend the information or emotions associated with interactions in online
spaces. Multimodal Large Language Models (MLLMs) have emerged as a promising
solution to these challenges, yet they struggle to accurately interpret human
emotions and complex content such as misinformation. This paper introduces
MM-Soc, a comprehensive benchmark designed to evaluate MLLMs' understanding of
multimodal social media content. MM-Soc compiles prominent multimodal datasets
and incorporates a novel large-scale YouTube tagging dataset, targeting a range
of tasks from misinformation detection, hate speech detection, and social
context generation. Through our exhaustive evaluation on ten size-variants of
four open-source MLLMs, we have identified significant performance disparities,
highlighting the need for advancements in models' social understanding
capabilities. Our analysis reveals that, in a zero-shot setting, various types
of MLLMs generally exhibit difficulties in handling social media tasks.
However, MLLMs demonstrate performance improvements post fine-tuning,
suggesting potential pathways for improvement. Our code and data are available
at https://github.com/claws-lab/MMSoc.git.
comment: In Proceedings of ACL 2024
♻ ☆ AMONGAGENTS: Evaluating Large Language Models in the Interactive Text-Based Social Deduction Game ACL 2024
Strategic social deduction games serve as valuable testbeds for evaluating
the understanding and inference skills of language models, offering crucial
insights into social science, artificial intelligence, and strategic gaming.
This paper focuses on creating proxies of human behavior in simulated
environments, with Among Us utilized as a tool for studying simulated human
behavior. The study introduces a text-based game environment, named
AmongAgents, that mirrors the dynamics of Among Us. Players act as crew members
aboard a spaceship, tasked with identifying impostors who are sabotaging the
ship and eliminating the crew. Within this environment, the behavior of
simulated language agents is analyzed. The experiments involve diverse game
sequences featuring different configurations of Crewmates and Impostor
personality archetypes. Our work demonstrates that state-of-the-art large
language models (LLMs) can effectively grasp the game rules and make decisions
based on the current context. This work aims to promote further exploration of
LLMs in goal-oriented games with incomplete information and complex action
spaces, as these settings offer valuable opportunities to assess language model
performance in socially driven scenarios.
comment: Wordplay @ ACL 2024
♻ ☆ Description-Based Text Similarity
Identifying texts with a given semantics is central for many information
seeking scenarios. Similarity search over vector embeddings appear to be
central to this ability, yet the similarity reflected in current text
embeddings is corpus-driven, and is inconsistent and sub-optimal for many use
cases. What, then, is a good notion of similarity for effective retrieval of
text?
We identify the need to search for texts based on abstract descriptions of
their content, and the corresponding notion of \emph{description based
similarity}. We demonstrate the inadequacy of current text embeddings and
propose an alternative model that significantly improves when used in standard
nearest neighbor search. The model is trained using positive and negative pairs
sourced through prompting a LLM, demonstrating how data from LLMs can be used
for creating new capabilities not immediately possible using the original
model.
comment: Accepted in COLM 2024
♻ ☆ Overview of AI-Debater 2023: The Challenges of Argument Generation Tasks
Jiayu Lin, Guanrong Chen, Bojun Jin, Chenyang Li, Shutong Jia, Wancong Lin, Yang Sun, Yuhang He, Caihua Yang, Jianzhu Bao, Jipeng Wu, Wen Su, Jinglu Chen, Xinyi Li, Tianyu Chen, Mingjie Han, Shuaiwen Du, Zijian Wang, Jiyin Li, Fuzhong Suo, Hao Wang, Nuanchen Lin, Xuanjing Huang, Changjian Jiang, RuiFeng Xu, Long Zhang, Jiuxin Cao, Ting Jin, Zhongyu Wei
In this paper we present the results of the AI-Debater 2023 Challenge held by
the Chinese Conference on Affect Computing (CCAC 2023), and introduce the
related datasets. We organize two tracks to handle the argumentative generation
tasks in different scenarios, namely, Counter-Argument Generation (Track 1) and
Claim-based Argument Generation (Track 2). Each track is equipped with its
distinct dataset and baseline model respectively. In total, 32 competing teams
register for the challenge, from which we received 11 successful submissions.
In this paper, we will present the results of the challenge and a summary of
the systems, highlighting commonalities and innovations among participating
systems. Datasets and baseline models of the AI-Debater 2023 Challenge have
been already released and can be accessed through the official website of the
challenge.
♻ ☆ Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
We introduce, Q-Sparse, a simple yet effective approach to training
sparsely-activated large language models (LLMs). Q-Sparse enables full sparsity
of activations in LLMs which can bring significant efficiency gains in
inference. This is achieved by applying top-K sparsification to the activations
and the straight-through-estimator to the training. We also introduce Block
Q-Sparse for batch training and inference. The key results from this work are,
(1) Q-Sparse can achieve results comparable to those of baseline LLMs while
being much more efficient at inference time; (2) We present an
inference-optimal scaling law for sparsely-activated LLMs; (3) Q-Sparse is
effective in different settings, including training-from-scratch,
continue-training of off-the-shelf LLMs, and finetuning; (4) Q-Sparse works for
both full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, the
synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the
cornerstone and a clear path to revolutionize the efficiency, including cost
and energy consumption, of future LLMs.
comment: Work in progress
♻ ☆ Large Language Models as Topological Structure Enhancers for Text-Attributed Graphs
The latest advancements in large language models (LLMs) have revolutionized
the field of natural language processing (NLP). Inspired by the success of LLMs
in NLP tasks, some recent work has begun investigating the potential of
applying LLMs in graph learning tasks. However, most of the existing work
focuses on utilizing LLMs as powerful node feature augmenters, leaving
employing LLMs to enhance graph topological structures an understudied problem.
In this work, we explore how to leverage the information retrieval and text
generation capabilities of LLMs to refine/enhance the topological structure of
text-attributed graphs (TAGs) under the node classification setting. First, we
propose using LLMs to help remove unreliable edges and add reliable ones in the
TAG. Specifically, we first let the LLM output the semantic similarity between
node attributes through delicate prompt designs, and then perform edge deletion
and edge addition based on the similarity. Second, we propose using
pseudo-labels generated by the LLM to improve graph topology, that is, we
introduce the pseudo-label propagation as a regularization to guide the graph
neural network (GNN) in learning proper edge weights. Finally, we incorporate
the two aforementioned LLM-based methods for graph topological refinement into
the process of GNN training, and perform extensive experiments on four
real-world datasets. The experimental results demonstrate the effectiveness of
LLM-based graph topology refinement (achieving a 0.15%--2.47% performance gain
on public benchmarks).
comment: 10 pages
♻ ☆ Arrows of Time for Large Language Models
We study the probabilistic modeling performed by Autoregressive Large
Language Models (LLMs) through the angle of time directionality, addressing a
question first raised in (Shannon, 1951). For large enough models, we
empirically find a time asymmetry in their ability to learn natural language: a
difference in the average log-perplexity when trying to predict the next token
versus when trying to predict the previous one. This difference is at the same
time subtle and very consistent across various modalities (language, model
size, training time, ...). Theoretically, this is surprising: from an
information-theoretic point of view, there should be no such difference. We
provide a theoretical framework to explain how such an asymmetry can appear
from sparsity and computational complexity considerations, and outline a number
of perspectives opened by our results.
comment: Corrected typos in Table 2. Added links. 12 figures, 20 pages
♻ ☆ Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis ICML 2024
State-of-the-art LLMs often rely on scale with high computational costs,
which has sparked a research agenda to reduce parameter counts and costs
without significantly impacting performance. Our study focuses on
Transformer-based LLMs, specifically applying low-rank parametrization to the
computationally intensive feedforward networks (FFNs), which are less studied
than attention blocks. In contrast to previous works, (i) we explore low-rank
parametrization at scale, up to 1.3B parameters; (ii) within Transformer
language models rather than convolutional architectures; and (iii) starting
from training from scratch. Experiments on the large RefinedWeb dataset show
that low-rank parametrization is both efficient (e.g., 2.6$\times$ FFN speed-up
with 32\% parameters) and effective during training. Interestingly, these
structured FFNs exhibit steeper scaling curves than the original models.
Motivated by this finding, we develop the wide and structured networks
surpassing the current medium-sized and large-sized Transformer in perplexity
and throughput performance. Our code is available at
https://github.com/CLAIRE-Labo/StructuredFFN/tree/main.
comment: Accepted by ICML 2024 Next Generation of Sequence Modeling
Architectures Workshop. Short version of arXiv:2406.16450
♻ ☆ Tree-Planner: Efficient Close-loop Task Planning with Large Language Models ICLR 2024
Mengkang Hu, Yao Mu, Xinmiao Yu, Mingyu Ding, Shiguang Wu, Wenqi Shao, Qiguang Chen, Bin Wang, Yu Qiao, Ping Luo
This paper studies close-loop task planning, which refers to the process of
generating a sequence of skills (a plan) to accomplish a specific goal while
adapting the plan based on real-time observations. Recently, prompting Large
Language Models (LLMs) to generate actions iteratively has become a prevalent
paradigm due to its superior performance and user-friendliness. However, this
paradigm is plagued by two inefficiencies: high token consumption and redundant
error correction, both of which hinder its scalability for large-scale testing
and applications. To address these issues, we propose Tree-Planner, which
reframes task planning with LLMs into three distinct phases: plan sampling,
action tree construction, and grounded deciding. Tree-Planner starts by using
an LLM to sample a set of potential plans before execution, followed by the
aggregation of them to form an action tree. Finally, the LLM performs a
top-down decision-making process on the tree, taking into account real-time
environmental information. Experiments show that Tree-Planner achieves
state-of-the-art performance while maintaining high efficiency. By decomposing
LLM queries into a single plan-sampling call and multiple grounded-deciding
calls, a considerable part of the prompt are less likely to be repeatedly
consumed. As a result, token consumption is reduced by 92.2% compared to the
previously best-performing model. Additionally, by enabling backtracking on the
action tree as needed, the correction process becomes more flexible, leading to
a 40.5% decrease in error corrections.
comment: Published in ICLR 2024
♻ ☆ A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding
Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, Hao Liu, Can Huang
Recently, many studies have demonstrated that exclusively incorporating
OCR-derived text and spatial layouts with large language models (LLMs) can be
highly effective for document understanding tasks. However, existing methods
that integrate spatial layouts with text have limitations, such as producing
overly long text sequences or failing to fully leverage the autoregressive
traits of LLMs. In this work, we introduce Interleaving Layout and Text in a
Large Language Model (LayTextLLM)} for document understanding. In particular,
LayTextLLM projects each bounding box to a single embedding and interleaves it
with text, efficiently avoiding long sequence issues while leveraging
autoregressive traits of LLMs. LayTextLLM not only streamlines the interaction
of layout and textual data but also shows enhanced performance in Key
Information Extraction (KIE) and Visual Question Answering (VQA). Comprehensive
benchmark evaluations reveal significant improvements, with a 27.2% increase on
KIE tasks and 12.0% on VQA tasks compared to previous state-of-the-art document
understanding MLLMs, as well as a 15.1% improvement over other SOTA OCR-based
LLMs on KIE tasks.
♻ ☆ Efficient Tuning and Inference for Large Language Models on Textual Graphs IJCAI2024
Rich textual and topological information of textual graphs need to be modeled
in real-world applications such as webpages, e-commerce, and academic articles.
Practitioners have been long following the path of adopting a shallow text
encoder and a subsequent graph neural network (GNN) to solve this problem. In
light of recent advancements in large language models (LLMs), it is apparent
that integrating LLMs for enhanced textual encoding can substantially improve
the performance of textual graphs. Nevertheless, the efficiency of these
methods poses a significant challenge. In this paper, we propose ENGINE, a
parameter- and memory-efficient fine-tuning method for textual graphs with an
LLM encoder. The key insight is to combine the LLMs and GNNs through a tunable
side structure, which significantly reduces the training complexity without
impairing the joint model's capacity. Extensive experiments on textual graphs
demonstrate our method's effectiveness by achieving the best model performance,
meanwhile having the lowest training cost compared to previous methods.
Moreover, we introduce two variants with caching and dynamic early exit to
further enhance training and inference speed. Specifically, caching accelerates
ENGINE's training by 12x, and dynamic early exit achieves up to 5x faster
inference with a negligible performance drop (at maximum 1.17% relevant drop
across 7 datasets). Our codes are available at:
https://github.com/ZhuYun97/ENGINE
comment: Accepted by IJCAI2024
♻ ☆ Learning a Patent-Informed Biomedical Knowledge Graph Reveals Technological Potential of Drug Repositioning Candidates
Drug repositioning-a promising strategy for discovering new therapeutic uses
for existing drugs-has been increasingly explored in the computational science
literature using biomedical databases. However, the technological potential of
drug repositioning candidates has often been overlooked. This study presents a
novel protocol to comprehensively analyse various sources such as
pharmaceutical patents and biomedical databases, and identify drug
repositioning candidates with both technological potential and scientific
evidence. To this end, first, we constructed a scientific biomedical knowledge
graph (s-BKG) comprising relationships between drugs, diseases, and genes
derived from biomedical databases. Our protocol involves identifying drugs that
exhibit limited association with the target disease but are closely located in
the s-BKG, as potential drug candidates. We constructed a patent-informed
biomedical knowledge graph (p-BKG) by adding pharmaceutical patent information.
Finally, we developed a graph embedding protocol to ascertain the structure of
the p-BKG, thereby calculating the relevance scores of those candidates with
target disease-related patents to evaluate their technological potential. Our
case study on Alzheimer's disease demonstrates its efficacy and feasibility,
while the quantitative outcomes and systematic methods are expected to bridge
the gap between computational discoveries and successful market applications in
drug repositioning research.
comment: We are sorry to withdraw this paper. We found some critical errors in
the introduction and results sections. Specifically, we found that the first
author have wrongly inserted citations on background works and he made
mistakes in the graph embedding methods and relevant results are wrongly
calculated. In this regard, we tried to revise this paper and withdraw the
current version. Thank you
♻ ☆ Multimodal Detection of Bots on X (Twitter) using Transformers
Although not all bots are malicious, the vast majority of them are
responsible for spreading misinformation and manipulating the public opinion
about several issues, i.e., elections and many more. Therefore, the early
detection of bots is crucial. Although there have been proposed methods for
detecting bots in social media, there are still substantial limitations. For
instance, existing research initiatives still extract a large number of
features and train traditional machine learning algorithms or use GloVe
embeddings and train LSTMs. However, feature extraction is a tedious procedure
demanding domain expertise. Also, language models based on transformers have
been proved to be better than LSTMs. Other approaches create large graphs and
train graph neural networks requiring in this way many hours for training and
access to computational resources. To tackle these limitations, this is the
first study employing only the user description field and images of three
channels denoting the type and content of tweets posted by the users. Firstly,
we create digital DNA sequences, transform them to 3d images, and apply
pretrained models of the vision domain, including EfficientNet, AlexNet, VGG16,
etc. Next, we propose a multimodal approach, where we use TwHIN-BERT for
getting the textual representation of the user description field and employ
VGG16 for acquiring the visual representation for the image modality. We
propose three different fusion methods, namely concatenation, gated multimodal
unit, and crossmodal attention, for fusing the different modalities and compare
their performances. Finally, we present a qualitative analysis of the behavior
of our best performing model. Extensive experiments conducted on the Cresci'17
and TwiBot-20 datasets demonstrate valuable advantages of our introduced
approaches over state-of-the-art ones.
comment: IEEE Transactions on Information Forensics and Security (Accepted)
♻ ☆ Building Intelligence Identification System via Large Language Model Watermarking: A Survey and Beyond
Xuhong Wang, Haoyu Jiang, Yi Yu, Jingru Yu, Yilun Lin, Ping Yi, Yingchun Wang, Yu Qiao, Li Li, Fei-Yue Wang
Large Language Models (LLMs) are increasingly integrated into diverse
industries, posing substantial security risks due to unauthorized replication
and misuse. To mitigate these concerns, robust identification mechanisms are
widely acknowledged as an effective strategy. Identification systems for LLMs
now rely heavily on watermarking technology to manage and protect intellectual
property and ensure data security. However, previous studies have primarily
concentrated on the basic principles of algorithms and lacked a comprehensive
analysis of watermarking theory and practice from the perspective of
intelligent identification. To bridge this gap, firstly, we explore how a
robust identity recognition system can be effectively implemented and managed
within LLMs by various participants using watermarking technology. Secondly, we
propose a mathematical framework based on mutual information theory, which
systematizes the identification process to achieve more precise and customized
watermarking. Additionally, we present a comprehensive evaluation of
performance metrics for LLM watermarking, reflecting participant preferences
and advancing discussions on its identification applications. Lastly, we
outline the existing challenges in current watermarking technologies and
theoretical frameworks, and provide directional guidance to address these
challenges. Our systematic classification and detailed exposition aim to
enhance the comparison and evaluation of various methods, fostering further
research and development toward a transparent, secure, and equitable LLM
ecosystem.
comment: 59 pages, 7 figures
♻ ☆ Artificial Agency and Large Language Models
The arrival of Large Language Models (LLMs) has stirred up philosophical
debates about the possibility of realizing agency in an artificial manner. In
this work we contribute to the debate by presenting a theoretical model that
can be used as a threshold conception for artificial agents. The model defines
agents as systems whose actions and goals are always influenced by a dynamic
framework of factors that consists of the agent's accessible history, its
adaptive repertoire and its external environment. This framework, in turn, is
influenced by the actions that the agent takes and the goals that it forms. We
show with the help of the model that state-of-the-art LLMs are not agents yet,
but that there are elements to them that suggest a way forward. The paper
argues that a combination of the agent architecture presented in Park et al.
(2023) together with the use of modules like the Coscientist in Boiko et al.
(2023) could potentially be a way to realize agency in an artificial manner. We
end the paper by reflecting on the obstacles one might face in building such an
artificial agent and by presenting possible directions for future research.
comment: Accepted for publication in journal Intellectica, special issue
"Philosophies of AI: thinking and writing with LLMs" (Intellectica, issue 81)
♻ ☆ RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models ACL 2024
The application scope of large language models (LLMs) is increasingly
expanding. In practical use, users might provide feedback based on the model's
output, hoping for a responsive model that can complete responses according to
their feedback. Whether the model can appropriately respond to users' refuting
feedback and consistently follow through with execution has not been thoroughly
analyzed. In light of this, this paper proposes a comprehensive benchmark,
RefuteBench, covering tasks such as question answering, machine translation,
and email writing. The evaluation aims to assess whether models can positively
accept feedback in form of refuting instructions and whether they can
consistently adhere to user demands throughout the conversation. We conduct
evaluations on numerous LLMs and find that LLMs are stubborn, i.e. exhibit
inclination to their internal knowledge, often failing to comply with user
feedback. Additionally, as the length of the conversation increases, models
gradually forget the user's stated feedback and roll back to their own
responses. We further propose a recall-and-repeat prompts as a simple and
effective way to enhance the model's responsiveness to feedback.
comment: ACL 2024 final version
♻ ☆ Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As
Eden Avnat, Michal Levy, Daniel Herstain, Elia Yanko, Daniel Ben Joya, Michal Tzuchman Katz, Dafna Eshel, Sahar Laros, Yael Dagan, Shahar Barami, Joseph Mermelstein, Shahar Ovadia, Noam Shomron, Varda Shalev, Raja-Elie E. Abdulnour
Clinical problem-solving requires processing of semantic medical knowledge
such as illness scripts and numerical medical knowledge of diagnostic tests for
evidence-based decision-making. As large language models (LLMs) show promising
results in many aspects of language-based clinical practice, their ability to
generate non-language evidence-based answers to clinical questions is
inherently limited by tokenization. Therefore, we evaluated LLMs' performance
on two question types: numeric (correlating findings) and semantic
(differentiating entities) while examining differences within and between LLMs
in medical aspects and comparing their performance to humans. To generate
straightforward multi-choice questions and answers (QAs) based on
evidence-based medicine (EBM), we used a comprehensive medical knowledge graph
(encompassed data from more than 50,00 peer-reviewed articles) and created the
"EBMQA". EBMQA contains 105,000 QAs labeled with medical and non-medical topics
and classified into numerical or semantic questions. We benchmarked this
dataset using more than 24,500 QAs on two state-of-the-art LLMs: Chat-GPT4 and
Claude3-Opus. We evaluated the LLMs accuracy on semantic and numerical question
types and according to sub-labeled topics. For validation, six medical experts
were tested on 100 numerical EBMQA questions. We found that both LLMs excelled
more in semantic than numerical QAs, with Claude3 surpassing GPT4 in numerical
QAs. However, both LLMs showed inter and intra gaps in different medical
aspects and remained inferior to humans. Thus, their medical advice should be
addressed carefully.
♻ ☆ MCFEND: A Multi-source Benchmark Dataset for Chinese Fake News Detection WWW 2024
The prevalence of fake news across various online sources has had a
significant influence on the public. Existing Chinese fake news detection
datasets are limited to news sourced solely from Weibo. However, fake news
originating from multiple sources exhibits diversity in various aspects,
including its content and social context. Methods trained on purely one single
news source can hardly be applicable to real-world scenarios. Our pilot
experiment demonstrates that the F1 score of the state-of-the-art method that
learns from a large Chinese fake news detection dataset, Weibo-21, drops
significantly from 0.943 to 0.470 when the test data is changed to multi-source
news data, failing to identify more than one-third of the multi-source fake
news. To address this limitation, we constructed the first multi-source
benchmark dataset for Chinese fake news detection, termed MCFEND, which is
composed of news we collected from diverse sources such as social platforms,
messaging apps, and traditional online news outlets. Notably, such news has
been fact-checked by 14 authoritative fact-checking agencies worldwide. In
addition, various existing Chinese fake news detection methods are thoroughly
evaluated on our proposed dataset in cross-source, multi-source, and unseen
source ways. MCFEND, as a benchmark dataset, aims to advance Chinese fake news
detection approaches in real-world scenarios.
comment: Accepted by the ACM Web Conference 2024 (WWW 2024) oral, dataset
available: https://github.com/TrustworthyComp
♻ ☆ Probing the Decision Boundaries of In-context Learning in Large Language Models
In-context learning is a key paradigm in large language models (LLMs) that
enables them to generalize to new tasks and domains by simply prompting these
models with a few exemplars without explicit parameter updates. Many attempts
have been made to understand in-context learning in LLMs as a function of model
scale, pretraining data, and other factors. In this work, we propose a new
mechanism to probe and understand in-context learning from the lens of decision
boundaries for in-context binary classification. Decision boundaries are
straightforward to visualize and provide important information about the
qualitative behavior of the inductive biases of standard classifiers. To our
surprise, we find that the decision boundaries learned by current LLMs in
simple binary classification tasks are often irregular and non-smooth,
regardless of linear separability in the underlying task. This paper
investigates the factors influencing these decision boundaries and explores
methods to enhance their generalizability. We assess various approaches,
including training-free and fine-tuning methods for LLMs, the impact of model
architecture, and the effectiveness of active prompting techniques for
smoothing decision boundaries in a data-efficient manner. Our findings provide
a deeper understanding of in-context learning dynamics and offer practical
improvements for enhancing robustness and generalizability of in-context
learning.
comment: 18 pages, code at https://github.com/siyan-zhao/ICL_decision_boundary
♻ ☆ The Honorific Effect: Exploring the Impact of Japanese Linguistic Formalities on AI-Generated Physics Explanations
This study investigates the influence of Japanese honorifics on the responses
of large language models (LLMs) when explaining the law of conservation of
momentum. We analyzed the outputs of six state-of-the-art AI models, including
variations of ChatGPT, Coral, and Gemini, using 14 different honorific forms.
Our findings reveal that honorifics significantly affect the quality,
consistency, and formality of AI-generated responses, demonstrating LLMs'
ability to interpret and adapt to social context cues embedded in language.
Notable variations were observed across different models, with some emphasizing
historical context and derivations, while others focused on intuitive
explanations. The study highlights the potential for using honorifics to adjust
the depth and complexity of AI-generated explanations in educational contexts.
Furthermore, the responsiveness of AI models to cultural linguistic elements
underscores the importance of considering cultural factors in AI development
for educational applications. These results open new avenues for research in
AI-assisted education and cultural adaptation in AI systems, with significant
implications for personalizing learning experiences and developing culturally
sensitive AI tools for global education.
♻ ☆ Video Understanding with Large Language Models: A Survey
Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu
With the burgeoning growth of online video platforms and the escalating
volume of video content, the demand for proficient video understanding tools
has intensified markedly. Given the remarkable capabilities of large language
models (LLMs) in language and multimodal tasks, this survey provides a detailed
overview of recent advancements in video understanding that harness the power
of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly
advanced, particularly their ability for open-ended multi-granularity (general,
temporal, and spatiotemporal) reasoning combined with commonsense knowledge,
suggesting a promising path for future video understanding. We examine the
unique characteristics and capabilities of Vid-LLMs, categorizing the
approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM,
and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based
on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as
Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this
survey presents a comprehensive study of the tasks, datasets, benchmarks, and
evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive
applications of Vid-LLMs across various domains, highlighting their remarkable
scalability and versatility in real-world video understanding challenges.
Finally, it summarizes the limitations of existing Vid-LLMs and outlines
directions for future research. For more information, readers are recommended
to visit the repository at
https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.
♻ ☆ A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks
Large language models (LLMs) have shown remarkable performance on many
different Natural Language Processing (NLP) tasks. Prompt engineering plays a
key role in adding more to the already existing abilities of LLMs to achieve
significant performance gains on various NLP tasks. Prompt engineering requires
composing natural language instructions called prompts to elicit knowledge from
LLMs in a structured way. Unlike previous state-of-the-art (SoTA) models,
prompt engineering does not require extensive parameter re-training or
fine-tuning based on the given NLP task and thus solely operates on the
embedded knowledge of LLMs. Additionally, LLM enthusiasts can intelligently
extract LLMs' knowledge through a basic natural language conversational
exchange or prompt engineering, allowing more and more people even without deep
mathematical machine learning background to experiment with LLMs. With prompt
engineering gaining popularity in the last two years, researchers have come up
with numerous engineering techniques around designing prompts to improve
accuracy of information extraction from the LLMs. In this paper, we summarize
different prompting techniques and club them together based on different NLP
tasks that they have been used for. We further granularly highlight the
performance of these prompting strategies on various datasets belonging to that
NLP task, talk about the corresponding LLMs used, present a taxonomy diagram
and discuss the possible SoTA for specific datasets. In total, we read and
present a survey of 44 research papers which talk about 39 different prompting
methods on 29 different NLP tasks of which most of them have been published in
the last two years.
♻ ☆ LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model ECCV24
The distribution of subpopulations is an important property hidden within a
dataset. Uncovering and analyzing the subpopulation distribution within
datasets provides a comprehensive understanding of the datasets, standing as a
powerful tool beneficial to various downstream tasks, including Dataset
Subpopulation Organization, Subpopulation Shift, and Slice Discovery. Despite
its importance, there has been no work that systematically explores the
subpopulation distribution of datasets to our knowledge. To address the
limitation and solve all the mentioned tasks in a unified way, we introduce a
novel concept of subpopulation structures to represent, analyze, and utilize
subpopulation distributions within datasets. To characterize the structures in
an interpretable manner, we propose the Subpopulation Structure Discovery with
Large Language Models (SSD-LLM) framework, which employs world knowledge and
instruction-following capabilities of Large Language Models (LLMs) to
linguistically analyze informative image captions and summarize the structures.
Furthermore, we propose complete workflows to address downstream tasks, named
Task-specific Tuning, showcasing the application of the discovered structure to
a spectrum of subpopulation-related tasks, including dataset subpopulation
organization, subpopulation shift, and slice discovery. Furthermore, we propose
complete workflows to address downstream tasks, named Task-specific Tuning,
showcasing the application of the discovered structure to a spectrum of
subpopulation-related tasks, including dataset subpopulation organization,
subpopulation shift, and slice discovery.
comment: ECCV24 Camera Ready
♻ ☆ CHATATC: Large Language Model-Driven Conversational Agents for Supporting Strategic Air Traffic Flow Management ICRA
Generative artificial intelligence (AI) and large language models (LLMs) have
gained rapid popularity through publicly available tools such as ChatGPT. The
adoption of LLMs for personal and professional use is fueled by the natural
interactions between human users and computer applications such as ChatGPT,
along with powerful summarization and text generation capabilities. Given the
widespread use of such generative AI tools, in this work we investigate how
these tools can be deployed in a non-safety critical, strategic traffic flow
management setting. Specifically, we train an LLM, CHATATC, based on a large
historical data set of Ground Delay Program (GDP) issuances, spanning 2000-2023
and consisting of over 80,000 GDP implementations, revisions, and
cancellations. We test the query and response capabilities of CHATATC,
documenting successes (e.g., providing correct GDP rates, durations, and
reason) and shortcomings (e.g,. superlative questions). We also detail the
design of a graphical user interface for future users to interact and
collaborate with the CHATATC conversational agent.
comment: 8 pages, 5 figures; minor revisions to address reviewer feedback for
final submission to the 11th International Conference on Research in Air
Transportation (ICRAT)
♻ ☆ Multi-Convformer: Extending Conformer with Multiple Convolution Kernels INTERSPEECH 2024
Convolutions have become essential in state-of-the-art end-to-end Automatic
Speech Recognition~(ASR) systems due to their efficient modelling of local
context. Notably, its use in Conformers has led to superior performance
compared to vanilla Transformer-based ASR systems. While components other than
the convolution module in the Conformer have been reexamined, altering the
convolution module itself has been far less explored. Towards this, we
introduce Multi-Convformer that uses multiple convolution kernels within the
convolution module of the Conformer in conjunction with gating. This helps in
improved modeling of local dependencies at varying granularities. Our model
rivals existing Conformer variants such as CgMLP and E-Branchformer in
performance, while being more parameter efficient. We empirically compare our
approach with Conformer and its variants across four different datasets and
three different modelling paradigms and show up to 8% relative word error
rate~(WER) improvements.
comment: Accepted to INTERSPEECH 2024
♻ ☆ Two-stage Generative Question Answering on Temporal Knowledge Graph Using Large Language Models ACL
Temporal knowledge graph question answering (TKGQA) poses a significant
challenge task, due to the temporal constraints hidden in questions and the
answers sought from dynamic structured knowledge. Although large language
models (LLMs) have made considerable progress in their reasoning ability over
structured data, their application to the TKGQA task is a relatively unexplored
area. This paper first proposes a novel generative temporal knowledge graph
question answering framework, GenTKGQA, which guides LLMs to answer temporal
questions through two phases: Subgraph Retrieval and Answer Generation. First,
we exploit LLM's intrinsic knowledge to mine temporal constraints and
structural links in the questions without extra training, thus narrowing down
the subgraph search space in both temporal and structural dimensions. Next, we
design virtual knowledge indicators to fuse the graph neural network signals of
the subgraph and the text representations of the LLM in a non-shallow way,
which helps the open-source LLM deeply understand the temporal order and
structural dependencies among the retrieved facts through instruction tuning.
Experimental results on two widely used datasets demonstrate the superiority of
our model.
comment: Accepted by ACL(Findings) 2024
♻ ☆ PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation ECCV 2024
Text-to-image diffusion models are well-known for their ability to generate
realistic images based on textual prompts. However, the existing works have
predominantly focused on English, lacking support for non-English text-to-image
models. The most commonly used translation methods cannot solve the generation
problem related to language culture, while training from scratch on a specific
language dataset is prohibitively expensive. In this paper, we are inspired to
propose a simple plug-and-play language transfer method based on knowledge
distillation. All we need to do is train a lightweight MLP-like
parameter-efficient adapter (PEA) with only 6M parameters under teacher
knowledge distillation along with a small parallel data corpus. We are
surprised to find that freezing the parameters of UNet can still achieve
remarkable performance on the language-specific prompt evaluation set,
demonstrating that PEA can stimulate the potential generation ability of the
original UNet. Additionally, it closely approaches the performance of the
English text-to-image model on a general prompt evaluation set. Furthermore,
our adapter can be used as a plugin to achieve significant results in
downstream tasks in cross-lingual text-to-image generation. Code will be
available at: https://github.com/OPPO-Mente-Lab/PEA-Diffusion
comment: ECCV 2024
♻ ☆ Tailoring Vaccine Messaging with Common-Ground Opinions NAACL
Rickard Stureborg, Sanxing Chen, Ruoyu Xie, Aayushi Patel, Christopher Li, Chloe Qinyu Zhu, Tingnan Hu, Jun Yang, Bhuwan Dhingra
One way to personalize chatbot interactions is by establishing common ground
with the intended reader. A domain where establishing mutual understanding
could be particularly impactful is vaccine concerns and misinformation. Vaccine
interventions are forms of messaging which aim to answer concerns expressed
about vaccination. Tailoring responses in this domain is difficult, since
opinions often have seemingly little ideological overlap. We define the task of
tailoring vaccine interventions to a Common-Ground Opinion (CGO). Tailoring
responses to a CGO involves meaningfully improving the answer by relating it to
an opinion or belief the reader holds. In this paper we introduce TAILOR-CGO, a
dataset for evaluating how well responses are tailored to provided CGOs. We
benchmark several major LLMs on this task; finding GPT-4-Turbo performs
significantly better than others. We also build automatic evaluation metrics,
including an efficient and accurate BERT model that outperforms finetuned LLMs,
investigate how to successfully tailor vaccine messaging to CGOs, and provide
actionable recommendations from this investigation.
Code and model weights: https://github.com/rickardstureborg/tailor-cgo
Dataset: https://huggingface.co/datasets/DukeNLP/tailor-cgo
comment: NAACL Findings 2024